Objectives

Data Science is a field of science. We try to extract useful information from data. In order to use the data efficiently and correctly we must understand the data first. According to the goal of the study, combining the domain knowledge, we then design the study. In this lecture we first go through some basic explore data analysis to understand the nature of the data, some plausible relationship among the variables.

Data mining tools have been expanded dramatically in the past 20 years. Linear model as a building block for data science is simple and powerful. We introduce/review simple linear model. The focus is to understand what is it we are modeling; how to apply the data to get the information; to understand the intrinsic variability that statistics have.

Contents:

  1. Suggested readings:
    • Chapter 2 and 3
    • Statistical Sleuth, Chapter 7/8
    • Data set: MLPayData_Total.csv
  2. Case Study

  3. EDA

  4. Simple regression (a quick review)
    • Model specification
    • OLS estimates and properties
    • R-squared and RSE
    • Confidence intervals for coefficients
    • Prediction intervals
    • Model diagnoses
  5. Appendices

Technical Note:

We hide all the outputs of R chunks by setting results = "hide" globally using knitr::opts_chunk$set() in the first chunk. You can setresults = "markup" as default or results = "hold" in the global setting to show outputs of every chunk; or in the chunk headers to show outputs for that chunk. Setting results = "hold" holds all the output pieces and push them to the end of a chunk.

DPLYR Cheat Sheet

ggplot Cheat Sheet

1 Case Study: Baseball

Baseball is one of the most popular sports in US. The highest level of baseball is Major League Baseball which includes American League and National League with total of 30 teams. New York Yankees, Boston Red Sox, Philadelphia Phillies and recently rising star team Oakland Athletics are among the top teams. Oakland A’s is a low budget team. But the team has been moving itself up mainly due to its General Manager (GM) Billy Beane who is well known to apply statistics in his coaching.

Questions of interests for us:

  • Q1: Will a team perform better when they are paid more?
  • Q2: Is Billy Beane (Oakland A’s GM) worth 12.5 million dollars for a period of 5 years, as offered by the Red Sox?

Read an article: Billion dollar Billy Beane.

Data: baseball.csv, consists of winning records and the payroll of all 30 ML teams from 1998 to 2014 (17 years). There are 162 games in each season. We will create the following two exaggerated variables:

  • payroll: total pay from 1998 to 2014 in billion dollars

  • win: average winning percentage for the span of 1998 to 2014

  • keep team names team.

To answer Q1:

  1. How does payroll relate to the performance measured by win?

  2. Given payroll= .84,
  • on average what would be the mean winning percentage
  • what do we expect the win to be for such A team?

(Oakland A’s: payroll=.84, win=.54 and Red Soc payroll=1.97, win=.55 )

2 Simple Linear Regression

Often we would like to explore the relationship between two variables. Will a team perform better when they are paid more? The simplest model is a linear model. Let the response \(y_i\) be the win and the explanatory variable \(x_i\) be payroll (\(i = 1, \dots, n=30\)).

Assume there is linear relationship between win and payroll, i.e.

\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i\]

Model interpretation:

  • We assume that given payroll, on average the win is a linear function
  • For each team the win is the average plus an error term
  • Parameters of interest
    • intercept: \(\beta_0\)
    • slope: \(\beta_1\)
    • both are unknown

Estimation

Once we have produce an estimate \((\hat\beta_0, \hat\beta_1)\), we can then

  • Interpret the slope
  • Estimate the mean win and
  • Predict win for a team based on the payroll

\[\hat y_i = \hat\beta_0 + \hat\beta_1 x_i.\]

How to estimate the parameters using the data we have? For example, how would you decide on the following three estimates?

2.1 Ordinary least squares (OLS) estimates

Given an estimate \((b_0, b_1)\), we first define residuals as the differences between actual and predicted values of the response \(y\), i.e.

\[\hat{\epsilon}_i = \hat{y}_i -b_0 - b_1 x_i.\] In previous plots, the residuals are the vertical lines between observations and the fitted line. Now we are ready to define the OLS estimate.

The OLS estimates \(\hat\beta_0\) and \(\hat\beta_1\) are obtained by minimizing sum of squared errors (RSS):

\[(\hat\beta_0, \hat\beta_1) = \arg\min_{b_0,\,b_1} \sum_{i=1}^{n}\hat{\epsilon}_i^2 = \arg\min_{b_0,\,b_1} \sum_{i=1}^{n} (y_i - b_0 - b_1 x_i)^{2}.\]

We can derive the solution of \(\hat\beta_0\) and \(\hat\beta_1\).

\[\begin{split} \hat\beta_0 &= \bar{y} - \hat\beta_1 \bar{x} \\ \hat\beta_1 &= r_{xy} \cdot \frac{s_y}{s_x} \end{split}\]

where

  • \(\bar{x}=\) sample mean of \(x\)’s (payroll)
  • \(\bar{y}=\) sample mean of \(y\)’s (win)
  • \(s_{x}=\) sample standard deviation of \(x\)’s (payroll)
  • \(s_{y}=\) sample standard deviation of \(y\)’s (win)
  • \(r_{xy}=\) sample correlation between \(x\) and \(y\).

The following shiny application shows how the fitted line and RSS change as we change \((b_0, b_1)\). (The following chunk is not required but for the purpose of demonstration. RUN THE CHUNK IN THE CONSOLE to let R session to serve as a server (Copy the chunk and paste in the Console). If you are interested, this is a step-by-step tutorial to get you started.)

2.1.1 lm()

The function lm() will be used extensively. This function solves the minimization problem that we defined above. Below we use win as the dependent \(y\) variable and payroll as our \(x\).

As we can see from the below output, this function outputs a list of many statistics. We will define these statistics later

We can also view a summary of the lm() output by using the summary() command.

## 
## Call:
## lm(formula = win ~ payroll, data = datapay)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.03932 -0.01356 -0.00134  0.00748  0.06660 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.4175     0.0148   28.23  < 2e-16 ***
## payroll       0.0621     0.0106    5.88  2.5e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0256 on 28 degrees of freedom
## Multiple R-squared:  0.553,  Adjusted R-squared:  0.537 
## F-statistic: 34.6 on 1 and 28 DF,  p-value: 2.52e-06
## 
##  [1] "call"          "terms"         "residuals"     "coefficients" 
##  [5] "aliased"       "sigma"         "df"            "r.squared"    
##  [9] "adj.r.squared" "fstatistic"    "cov.unscaled"

Notice that the outputs of myfit0 and summary(myfit0) are different

## 
## Call:
## lm(formula = win ~ payroll, data = datapay)
## 
## Coefficients:
## (Intercept)      payroll  
##      0.4175       0.0621

To summarize the OLS estimate, \(\hat{\beta}_0 =\) 0.417 and \(\hat{\beta}_1 =\) 0.062, we have the following estimator:

\[\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i = 0.423 + 0.061 \cdot x_i\]

Here are what we can say:

Interpretation of the slope:

  • When payroll increases by 1 unit (1 billion), we expect, on average the win will increase about 0.062.

  • When payroll increases by .5 unit (500 million), we expect, on average the win will increase about 0.031.

Prediction equation:

  • For all the team similar to Oakland Athletics whose payroll is 0.888, we estimate on average the win to be

\[\hat{y}_{\text{Oakland Athletics}} = 0.423 + 0.061 \times 0.841 = .474\] Or we can inline the solution from the function output:

\[\hat{y}_{\text{Oakland Athletics}} = 0.417 + 0.062 \times 0.888.\]

  • Residuals: For Oakland Athletics, the real win is 0.888. So the residual for team Oakland Athletics is

\[\hat{\epsilon}_{\text{Oakland}}= .545 - .474 = .071\] or

Here are a few rows that show the fitted values from our model

Scatter plot with the LS line added

Base R

ggplot

## `geom_smooth()` using formula 'y ~ x'

HERE is how the article concludes that Beane is worth as much as the GM in Red Sox. By looking at the above plot, Oakland A’s win pct is more or less same as that of Red Sox, so based on the LS equation, the team should have paid 2 billion. Do you agree on this statement?????

2.2 Goodness of Fit: \(R^2\)

How well does the linear model fit the data? A common, popular notion is through \(R^2\).

Residual Sum of Squares (RSS):

The least squares approach chooses \(\hat\beta_0\) and \(\hat\beta_1\) to minimize the RSS. RSS is defined as:

\[RSS = \sum_{i=1}^{n} \hat{\epsilon}_i^{2} = \sum_{i=1}^{n} (y_i - \hat\beta_0 - \hat\beta_1 x_i)^{2}\]

## [1] 0.0184

Mean Squared Error (MSE):

Mean Squared Error (MSE) is the average of the squares of the errors, i.e. the average squared difference between the estimated values and the actual values. For simple linear regression, MSE is defined as:

\[MSE = \frac{RSS}{n-2}.\]

Residual Standard Error (RSE)/Root-Mean-Square-Erro(RMSE):

Residual Standard Error (RSE) is the square root of MSE. For simple linear regression, RSE is defined as:

\[RSE = \sqrt{MSE} = \sqrt{\frac{RSS}{n-2}}.\]

## [1] 0.0256
## [1] 0.0256

Total Sum of Squares (TSS):

TSS measures the total variance in the response \(Y\), and can be thought of as the amount of variability inherent in the response before the regression is performed. In contrast, RSS measures the amount of variability that is left unexplained after performing the regression.

\[TSS = \sum_{i=1}^{n} (y_i - \bar{y_i})^{2}\]

## [1] 0.0411

\(R^2\):

\(R^{2}\) measures the proportion of variability in \(Y\) that can be explained using \(X\). An \(R^{2}\) statistic that is close to 1 indicates that a large proportion of the variability in the response has been explained by the regression.

\[R^{2} = \frac{TSS - RSS}{TSS}\]

## [1] 0.553
## [1] 0.553
## [1] 0.553

Remarks:

How large \(R^2\) needs to be so that you are comfortable to use the linear model? Though \(R^2\) is a very popular notion of goodness of fit, but it has its limitation. Mainly all the sum of squared errors defined so far are termed as Training Errors. It really only measures how good a model fits the data that we use to build the model. It may not generate well to unseen data.

2.3 Inference

One of the most important aspect about statistics is to realize the estimators or statistics we propose such as the least squared estimators for the slope and the intercept they change as a function of data. Understanding the variability of the statistics, providing the accuracy of the estimators are one of the focus as statisticians.

Recall that we assume a linear, \[y_i = \beta_0 + \beta_1 x_i + \epsilon_i.\] We did not impose assumptions on \(\epsilon_i\) when using OLS. In order to provide some desired statistical properties and guarantees to our OLS estimate \((\hat\beta_0, \hat\beta_1)\), we need to impose assumptions.

2.3.1 Linear model assumptions

  • Linearity:

\[\textbf{E}(y_i | x_i) = \beta_0 + \beta_1 x_i\]

  • homoscedasticity:

\[\textbf{Var}(y_i | x_i) = \sigma^2\]

  • Normality:

\[\epsilon_i \overset{iid}{\sim} \mathcal{N}(0, \sigma^2)\] or

\[y_i \overset{iid}{\sim} \mathcal{N}( \beta_0 + \beta_1 x_i, \sigma^2)\]

2.3.2 Inference for the coefficients: \(\beta_0\) and \(\beta_1\)

Under the model assumptions:

  1. \(y_i\) independently and identically normally distributed
  2. The mean of \(y\) given \(x\) is linear
  3. The variance of \(y\) does not depend on \(x\)

The OLS estimates \(\hat \beta = (\hat\beta_0, \hat\beta_1)\) has the following properties:

  1. Unbiasedness

\[\textbf{E}(\hat{\beta}) = \beta\]

  1. Normality

\[\hat{\beta}_1 \sim \mathcal{N}(\beta_1, \textbf{Var}(\hat{\beta}_1))\] where

\[\textbf{Var}(\hat{\beta}_1) = \frac{\sigma^{2}}{x_x^2} = \frac{\sigma^{2}} {\sum_{i=1}^{n} (x_i - \bar{x})^{2}}.\]

In general, \[\hat\beta \sim \mathcal{N} (\beta,\ \sigma^2 (X^TX)^{-1})\] Here \(X\) is the design matrix where the first column is 1’s and the second column is the values of x, in our case it is the column of payrolls1.

Confidence intervals for the coefficients

\(t\)-interval and \(t\)-test can be constructed using the above results.

For example, the 95% confidence interval for \(\beta\) approximately takes the form

\[\hat\beta \pm 2 \cdot SE(\hat\beta).\]

Tests to see if the slope is 0:

We can also perform hypothesis test on the coefficients. To be specific, we have the following test.

\[H_0: \beta_{1}=0 \mbox{ v.s. } H_1: \beta_{1} \not= 0\]

To test the null hypothesis, we need to decide whether \(\hat \beta_1\) is far away from 0, which depends on \(SE(\hat \beta_1)\). We now define the test statistics as follows.

\[t = \frac{\hat\beta_1 - 0}{SE(\hat \beta_1)}\]

Under the null hypothesis \(\beta_{1}=0\), \(t\) will have a \(t\)-distribution with \((n-2)\) degrees of freedom. Now we can compute the probability of \(T\sim t_{n-2}\) equal to or larger than \(|t|\), which is termed \(p-value\). Roughly speaking, a small \(p\)-value means the odd of \(\beta_{1}=0\) is small, then we can reject the null hypothesis.

\(SE(\beta)\), \(t\)-value and \(p\)-value are included in the summary output.

## 
## Call:
## lm(formula = win ~ payroll, data = datapay)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.03932 -0.01356 -0.00134  0.00748  0.06660 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.4175     0.0148   28.23  < 2e-16 ***
## payroll       0.0621     0.0106    5.88  2.5e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0256 on 28 degrees of freedom
## Multiple R-squared:  0.553,  Adjusted R-squared:  0.537 
## F-statistic: 34.6 on 1 and 28 DF,  p-value: 2.52e-06

The confint() function returns the confident interval for us (95% confident interval by default).

##              2.5 % 97.5 %
## (Intercept) 0.3872 0.4478
## payroll     0.0405 0.0838
##              0.5 % 99.5 %
## (Intercept) 0.3766 0.4583
## payroll     0.0329 0.0913

2.3.3 Confidence for the mean response

We use a confidence interval to quantify the uncertainty surrounding the mean of the response (win). For example, for teams like Oakland A’s whose payroll=.841, a 95% Confidence Interval for the mean of response win is

\[\hat{y}_{|x=.841} \pm t_{(\alpha/2, n-2)} \times \sqrt{MSE \times \left(\frac{1}{n} + \frac{(.841-\bar{x})^2}{\sum(x_i-\bar{x})^2}\right)}.\]

The predict() provides prediction with confidence interval using the argument interval="confidence".

## $fit
##    fit   lwr   upr
## 1 0.47 0.455 0.484
## 
## $se.fit
## [1] 0.00695
## 
## $df
## [1] 28
## 
## $residual.scale
## [1] 0.0256

Because \[\textbf{Var}(\hat{y}_{|x}) = \sigma^{2} \Bigg(\frac{1}{n}+\frac{(x-\bar{x})^2}{\sum_{i=1}^{n} (x_i - \bar{x})^{2}}\Bigg).\]

We can show the confidence interval for the mean response using ggplot() with geom_smooth() using the argument se=TRUE.

## `geom_smooth()` using formula 'y ~ x'

2.3.4 Prediction interval for a response

A prediction interval can be used to quantify the uncertainty surrounding win for a particular team.

\[\hat{y}_{|x} \pm t_{(\alpha/2, n-2)} \times \sqrt{MSE \times \left( 1+\frac{1}{n} + \frac{(x-\bar{x})^2}{\sum(x_i-\bar{x})^2}\right)}\]

We now produce 95% & 99% PI for a future \(y\) given \(x=.841\) using predict() again but with the argument interval="prediction".

## $fit
##    fit   lwr   upr
## 1 0.47 0.415 0.524
## 
## $se.fit
## [1] 0.00695
## 
## $df
## [1] 28
## 
## $residual.scale
## [1] 0.0256
## 
## $fit
##    fit   lwr   upr
## 1 0.47 0.396 0.543
## 
## $se.fit
## [1] 0.00695
## 
## $df
## [1] 28
## 
## $residual.scale
## [1] 0.0256

Now we plot the confidence interval (shaded) along with the 95% prediction interval in blue and 99% prediction interval in green.

## Warning in predict.lm(myfit0, interval = "prediction"): predictions on current data refer to _future_ responses
## Warning in predict.lm(myfit0, interval = "prediction", level = 0.99): predictions on current data refer to _future_ responses
## `geom_smooth()` using formula 'y ~ x'

From our output above, the 95% prediction interval varies from \(.41\) to \(.53\) for a team like the Oakland A’s. But its win is \(.54\). So it is somewhat unusual but not that unusual! The 99% prediction interval contains the true Oakland winning percentage of 0.539.

Read page 82 of ILSR to fully understand the difference between confidence and prediction intervals.

2.4 Model diagnoses

How reliable our confidence intervals and the tests are? We will need to check the model assumptions in the following steps:

  1. Check linearity first; if linearity is satisfied, then

  2. Check homoscedasticity; if homoscedasticity is satisfied, then

  3. Check normality.

2.4.1 Residual plot

We plot the residuals against the fitted values to

  • check linearity by checking whether the residuals follow a symmetric pattern with respect to \(h=0\).

  • check homoscedasticity by checking whether the residuals are evenly distributed within a band.

2.4.2 Check normality

We look at the qqplot of residuals to check normality.

3 Summary

  • EDA: We understand the data by exploring the basic structure of the data (number of observations, variables and missing data), the descriptive statistics and the relationship between variables. Visualization is a crucial step for EDA. Both graphical tools in base R an ggplot2 come in handy.
  • OLS: Simple linear regression is introduced. We study the OLS estimate with its interpretation and properties. We evaluate the OLS estimate and provide inference. It is important to perform model diagnoses before coming to any conclusion. The lm() function is one of the most important tools for statisticians.

4 Appendices

We have put a few topics here. Some of the sections might be covered in the class.

4.1 Appendix 1: Reverse Regression

Now we understand more about regression method. If one wants to predict payroll using win as predictor can we solve for payroll using the LS equation above to predict payroll?

The answer is NO, and why not?

We first plot our original model:

\[y_{win} = 0.42260 + 0.06137 \cdot x_{payroll}\]

Now, we reverse the regression and look at the summary output

## 
## Call:
## lm(formula = payroll ~ win, data = datapay)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7891 -0.1967 -0.0326  0.1832  0.6949 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -3.119      0.758   -4.11  0.00031 ***
## win            8.893      1.512    5.88  2.5e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.306 on 28 degrees of freedom
## Multiple R-squared:  0.553,  Adjusted R-squared:  0.537 
## F-statistic: 34.6 on 1 and 28 DF,  p-value: 2.52e-06

We can now overlay our two regressions:

\[y_{payroll} = -2.7784 + 8.0563 \cdot x_{win}\]

\[y_{win} = 0.42260 + 0.06137 \cdot x_{payroll}\] Notice that this is not the same as solving \(x_{payroll}\) given \(y_{payroll}\) from the first equation which is \[ x_{win} = 1/8.0563 y_{payroll} + 2.7784/8.0563 =0.344873 + .124 y_{payroll} \]

Conclusion: Two lines are not the same!!!!!

We may also want to get the LS equation w/o Oakland first.

## 
## Call:
## lm(formula = win ~ payroll, data = subdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.03685 -0.01097  0.00147  0.01169  0.05281 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.40800    0.01333   30.61  < 2e-16 ***
## payroll      0.06749    0.00943    7.16  1.1e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0225 on 27 degrees of freedom
## Multiple R-squared:  0.655,  Adjusted R-squared:  0.642 
## F-statistic: 51.3 on 1 and 27 DF,  p-value: 1.07e-07

The plot below shows the effect of Oakland on our linear model.

We may also want to check the effect of Yankees (2.70, .58)

4.3 Appendix 3: Investigate R-Squared

4.4 Appendix 4: More on Model Diagnoses

What do we expect to see even all the model assumptions are met?

    1. Variability of the ls estimates \(\beta\)’s
    1. Variability of the \(R^{2}\)’s
    1. Variability of the \(\hat \sigma^{2}\)’s
    1. Model diagnoses: through residuals

We demonstrate this through a simulation.

Here is a case that all the linear model assumptions are met. Once again everything can be checked by examining the residual plots

  1. Randomize X

  1. Generate y only each time to see the variability of ls estimates. The true \(y\) is

\[y = 1 +2x + \mathcal{N}(0,2^2) \quad i.e. \quad \beta_0 = 1 ,\beta_1 = 2, \sigma^2 = 4.\]

The theory says that

\[\hat\beta_1 \sim \mathcal{N} (\beta_1 = 2,\, \textbf{Var}(\hat\beta_1)).\]

Plots

4.5 Appendix 5: Sample Statistics

We remind the readers definition of sample statistics here.

  • Sample mean:

\[\bar{y} = \frac{1}{n}\sum_{i=1}^n y_i\]

  • Sample variance:

\[s^2 = \frac{\sum_{i=1}^{n}(y_i - \bar{y})^2}{n-1}\]

  • Sample Standard Deviation:

\[s = \sqrt\frac{\sum_{i=1}^{n}(y_i - \bar{y})^2} {n - 1}\]

  • Sample correlation

\[r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2 \sum_{i=1}^n (y_i - \bar{y})^2}} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{s_x s_y} \]